Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of CDK ExhaustiveFragmenter #133

Open
wants to merge 11 commits into
base: production
Choose a base branch
from

Conversation

JonasSchaub
Copy link
Collaborator

No description provided.

@JonasSchaub JonasSchaub added the enhancement New feature or request label Oct 7, 2024
@JonasSchaub JonasSchaub changed the title Initialization of new fragmenter class Integration of CDK ExhaustiveFragmenter Oct 7, 2024
@ToLeWeiss ToLeWeiss marked this pull request as ready for review October 22, 2024 15:18
@JonasSchaub
Copy link
Collaborator Author

JonasSchaub commented Oct 23, 2024

@ToLeWeiss , I fear performance will be a big issue here. I just started a fragmentation with 1000 natural products and it is still running. We need to discuss this, test it more, and maybe document it somewhere.

Final result of my run:

Oct 23, 2024 1:45:35 PM de.unijena.cheminf.mortar.controller.MainViewController lambda$importMoleculeFile$25
INFO: Successfully imported 1000 molecules from file: Picked_COCONUT_subset_1000_mols.sdf; 0 molecules could not be parsed into the internal data model (SMILES code generation failed). See above how many molecules could not be read from the input file at all or produced exceptions while preprocessing.
Oct 23, 2024 1:45:41 PM de.unijena.cheminf.mortar.controller.MainViewController startFragmentation
INFO: Start of method startFragmentation
Oct 23, 2024 1:45:41 PM de.unijena.cheminf.mortar.model.fragmentation.FragmentationService startFragmentation
INFO: Fragmentation "Exhaustive Fragmenter" (Exhaustive Fragmenter) starting. Current memory consumption: 168 MB
Oct 23, 2024 2:23:38 PM de.unijena.cheminf.mortar.model.fragmentation.FragmentationService startFragmentation
INFO: Fragmentation "Exhaustive Fragmenter" (Exhaustive Fragmenter) of 1,000 molecules complete. It took 2,277,137 ms. Current memory consumption: 2,496 MB
Oct 23, 2024 2:23:38 PM de.unijena.cheminf.mortar.model.fragmentation.FragmentationService startSingleFragmentation
INFO: Number of different fragments extracted: 169,490
Oct 23, 2024 2:23:39 PM de.unijena.cheminf.mortar.controller.MainViewController lambda$startFragmentation$44
INFO: End of method startFragmentation after 2277.6163057

And afterwards, it aborted with an out of memory error...
But as you can see, 169,490 fragments were generated from 1,000 natural products in 38 minutes (4 parallel threads and while I was also working on other things on the same laptop).

Copy link
Collaborator Author

@JonasSchaub JonasSchaub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just some minor things in the comments.

if (!this.shouldBePreprocessed(aMolecule)) {
return aMolecule.clone();
}
return aMolecule.clone();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if statement is unnecessary. I guess you wanted to leave room between the if statement and the return statement for actual preprocessing should you need to add it later? If yes, please add a comment, at least. Or simply remove the if.

* <a href="https://cdk.github.io/cdk/latest/docs/api/org/openscience/cdk/fragment/ExhaustiveFragmenter.html">
* exhaustive fragmentation
* </a>
* from the CDK, available for MORTAR.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary comma

/**
* The name of the algorithm used for fragmentation.
*/
private static final String ALGORITHM_NAME = "Exhaustive Fragmenter";
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two should be public.

this.settingNameTooltipTextMap = new HashMap<>(tmpNumberOfSettings,
BasicDefinitions.DEFAULT_HASH_COLLECTION_LOAD_FACTOR);
this.settingNameDisplayNameMap = new HashMap<>(tmpNumberOfSettings,
BasicDefinitions.DEFAULT_HASH_COLLECTION_LOAD_FACTOR);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use CollectionUtil.calculateInitialHashCollectionCapacity() or HashMap.newHashMap().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now that you only have one setting but are initialising the maps with a capacity of 2, ok. But for the future, please use the ways I indicated.

this.settingNameDisplayNameMap = new HashMap<>(tmpNumberOfSettings,
BasicDefinitions.DEFAULT_HASH_COLLECTION_LOAD_FACTOR);
this.cdkEFInstance = new ExhaustiveFragmenter();
this.minimumFragmentSize = new SimpleIntegerProperty(this,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put -Setting at the end of this variable or -Preference. Otherwise, one might expect an integer here.

}
//</editor-fold>
IAtomContainer tmpMoleculeClone = aMolecule.clone();
List<IAtomContainer> tmpFragments = new ArrayList<>(0);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think of a more suiting default initial size, like maybe the number of bonds in the molecule times two?

IAtomContainer tmpMoleculeClone = aMolecule.clone();
List<IAtomContainer> tmpFragments = new ArrayList<>(0);
try {
SmilesParser tmpSmilesParser = new SmilesParser(DefaultChemObjectBuilder.getInstance());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use SilentChemObjectBuilder

SmilesParser tmpSmilesParser = new SmilesParser(DefaultChemObjectBuilder.getInstance());
this.cdkEFInstance.generateFragments(tmpMoleculeClone);
// there is also an option to extract atom containers directly with getFragmentsAsContainers but this oversaturates
// fragments described in this issue https://github.com/cdk/cdk/issues/1119.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that you put this here but please flag it with TODO:

IAtomContainer tmpOriginalMolecule;
List<IAtomContainer> tmpFragmentList;
CDKExhaustiveFragmenter tmpFragmenter = new CDKExhaustiveFragmenter();
tmpFragmenter.setFragmentSaturationSetting(IMoleculeFragmenter.FragmentSaturationOption.HYDROGEN_SATURATION);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not make much sense, given that you deactivated that feature.


@Override
public FragmentSaturationOption getFragmentSaturationSetting() {
//TODO: there is currently no possibility to implement saturation settings for the exhaustive fragmenter.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe this in more detail here like you did below?
And the way to do it here would be to throw an UnsupportedOperationException in these three methods. But please test whether this impacts running MORTAR. Since you do not add the property to the list of settings, we should be fine.

@JonasSchaub
Copy link
Collaborator Author

When the new fragmenter is added, we also need to describe it in the tutorial: https://docs.google.com/document/d/1dbpSZfdmOYaQqdYTDZj0NfKuy9TMp6T4SF6lCBB7bFo/edit?usp=sharing
@ToLeWeiss could you suggest some edits there? I need to redo some of the pictures, so that they are consistent.

@JonasSchaub
Copy link
Collaborator Author

@ToLeWeiss ready for re-review?

Copy link

sonarcloud bot commented Nov 7, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
10.6% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

@ToLeWeiss
Copy link
Collaborator

Now it is ready for a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants